Walking style, or gait, is known to differ between individuals and to be fairly stable, where as deliberate imitation of an other person’s walking style is difficult. Monitoring these movements can be used, like a fingerprint or retinal scan, to recognize and clearly identify or verify an individual.

Typically, vision based methods are used for gait recognition. In this work we try to identify a person using acceleration signals from a normal walk. In the past, it was expensive to collect sensor data, But now thanks to personal tracking devices, like smartphone, we are able to get these data more easily.

Data Collection

In order to collect data, we used the Arduino Science Journal app to record data from a triaxial accelerometer to measure acceleration. The data from the accelerometer includes the acceleration along the x-axis, y-axis, and z-axis. These axes capture the horizontal movement of the user (x-axis), forward/backward movement (y-axis), and upward/downward movement (z-axis).

Each subject (the four component of our group) carrying the phone in their right pant pocket.

Each individual generated a .csv file, containing the following information:

Data wrangling

We removed the first and the last minute registration, for each dataset that correspond to time used to place the sensors and to switch off the registration. Furthemore, we have transformed the relative time variable from milliseconds to seconds.

EDA

In this step, we have done different plots to understand the behavior of data.

 

 

 

 

 

Feature engineering

In order to extract features from these data, we used windowing with overlapping. We propose to extract the following features:

We computed these statistics on each of the three axes and we got 12 features.

Windowing

A straight-forward data preparation approach that was used for classical machine learning methods involves dividing the input signal data into windows of signals, where a given window may have one to a few seconds of observation data. This is often called a sliding window.

In this technique we divide the data into smaller sets of the same size. Individual windows may overlap in time in order to reduce the loss of information at the edges of the window; the overlap between windows is set to be at 50% by default.

The window size is an important parameter to choose; for this reason, we did a grid search on it, in order to find the most performing size.

Model training

Many classifiers were taken into consideration, in order to solve this multi-class problem.

First, we split the dataset into train (80%) and test (20%). Then we used caret to train the models and find the best parameters for K-NN and SVM, using 5-Fold Cross Validation. The number of mtry (number of variables randomly sampled as candidates at each split) for the Random Forest is set to \(\sqrt{\text{number of features}}\), while the number of tree is set to 500.

Window size

From this plot it’s clear that increasing the window size we get a better performance. In particular, the accuracy is increasing until 1 sec, then it becomes stable around 0.99.

At the end, we decided to use a window that corresponds to 1.28 seconds. The reason is that this window can adequately represent cycles in walking activity. Reference For each window we have 20 samples.

Let’s explore visually some of the features created.

 

 

As we can see, for the x-axis the acceleration of the subject 1 and 2 is always around 0, while for the subject 3 we have only negative values of the mean and for the subject 4, the range of values is positive.

Regarding to y-axis, the acceleration of all subject is around -10, but with different variability. The same behavior also for the z-axis.

 

Comparing models

method accuracy
Logistic 0.9823859
LDA 0.9759808
KNN 0.9767814
SVM 0.9807846
RF 0.9831865

The performance of the methods are very similar to each other; we choose Random Forest that is the best one. Let’s see in details the confusion matrix.

The accuracy is 0.9856

IML

Machine learning models usually perform really well for predictions, but are not interpretable. We decided to implement different techniques to explain the Random Forest model.

Features Importance

 

 

As we can see looking at the plot, the most important features are:

  • Mean Accelerometer x-axis
  • Mean Accelerometer x-axis
  • and others statistics computed on z-accelerometer

Now, we can go deeper into the analysis of these features. Let’s see the ALE Plot and Shapley Values Plot.

 

ALE Plot

 

 

The Mean Acceleration on the z-axis is more relevant for the classification of class 1 and 3 In fact, as we can see from the plot, for these classes we have a greater variation with respect to the other two classes. This happens also for the Standard Deviation on the same axis.

Similar speech for the Mean Acceleration on the x-axis, that is more relevant for the 2 and 4 classes.

 

 

Shapley Values Plot

 

 

We saw that only few variables have a big importance. For this reason, we can try to train a new model, using a lower number of variables.

The features are:

  • mean_Accz
  • mean_Accx
  • sd_Accz
  • absd_Accz
  • rms_Accz

The accuracy is 0.9824

Using a lower number of features, the performance is quite the same.

 

 

Clustering

At this point, we wanted to use another approach. We removed the target variable and we used an unsupervised method in order to identify the different classes. In order to find the best number of clusters, we computed the Total Within SUm of Square for different numbers of K.

Looking at the plot, we suppose that the “elbow” is around K = 2 and K = 4. Let’s see also the Average Silhouette Plot.

 

The K that maximizes the Average Silhouette Width is for K = 2 and K = 4. At this point, we computed the K-Means for these values of K.

 

Silhouette Plot

  cluster size ave.sil.width
1       1 3815          0.53
2       2 2437          0.28

 

  cluster size ave.sil.width
1       1 1073          0.24
2       2 1726          0.33
3       3 1748          0.28
4       4 1705          0.51

 

At this point, we made a plot to see how the different clusters are distributed with respect to the Mean Accelerometer on the 3-axis.

For K = 2, the clustering is able to distinguish two different groups with respect to the x-axis.

 

 

For K = 4, we have a good separation between the clusters 2 and 4, while is difficult to distinguish between clusters 1 and 3. From the above clusters, we see that the the primary difference in walking styles between people is caused by acceleration values in x and z axes, while y axis acceleration is about the same and this is a consistent result with what has been seen in the IML section.

 

Given that we know the ground truth, we compared the clusters obtained with the original labels.

The cluster is able to identify the different subjects, but there is a little bit of confusion with the separation between the subjects 1 and 3. Furthermore, it is not able to identify well the set of samples that is far away from the central part of distribution.

 

This situation is more clear looking at the confusion matrix between the true labels and the discovered clusters.

The accuracy is 0.7182

Conclusions

The performances are excellent for all models, the best result is obtained with the random forest. The window size is relevant for the accuracy, in particular we noted that after a window size of 0.7 seconds, we had a score greater than 0.90 and after 1 second the performances were very similar to each other with good results.

Future Works

  • Use a low-pass filter in order to separate the linear acceleration due to the body motion and the acceleration due to the gravity.

  • Dynamic allocation window size